Ijraset Journal For Research in Applied Science and Engineering Technology
Authors: Vishnuprasad ., Paul Martin, Salman Nazeer, Prof. Vydehi K
DOI Link: https://doi.org/10.22214/ijraset.2023.53578
Certificate: View Certificate
Meeting transcripts produced by tools like Microsoft Teams and Google Meet, are useful for recording discussions and decisions made during meetings. However, reading through long transcripts can be time-consuming and may not always be the most efficient way to understand the key points and conclusions of a meeting. Meeting summarization is a subfield of natural language processing that can extract important information from meeting transcripts and generate a concise summary. This summary can be used to quickly understand the key points and conclusions of the meeting, and can be especially useful for stakeholders who were not able to attend the meeting in person. Several natural language processing techniques can be used to create summaries of meeting transcripts, such as the term frequency-inverse document frequency (TF-IDF) method, PageRank algorithm, Named Entity Recognition, Topic Modeling and specific summarization algorithms. Each technique has its own advantages and limitations, and the appropriate technique can be chosen based on the specific needs and requirements of the organization, such as accuracy, efficiency, and customization.
I. INTRODUCTION
Document summarization is the process of creating a shorter version of a document that captures the most important information from the original document. This can be useful for a variety of purposes, such as quickly conveying the main points of a document to a reader who may not have time to read the entire thing, or for organizing and indexing large amounts of documents.
There are two main approaches to document summarization: extractive and abstractive. Extractive summarization techniques use mathematical and statistical methods to identify the most important words, phrases, or sentences in the original document and include them in the summary. These techniques do not generate new text, but rather extract and compile the most important information from the original document. One example of an extractive summarization technique is the singular value decomposition
(SVD) method. SVD is a mathematical technique that decomposes a matrix into its constituent parts, and it can be used to identify the most important words or phrases in a document based on their frequency and co-occurrence within the document. Other extractive techniques include keyword extraction, which involves identifying the most important words in a document based on their frequency or importance, and sentence extraction, which involves selecting the most important sentences from the original document. Abstractive summarization techniques, on the other hand, use language semantics and natural language generation (NLG) techniques to generate new text and summaries based on the content of the original document. These techniques aim to capture the main ideas and concepts of the original document, rather than just the specific words and phrases used.
One example of an abstractive summarization technique is the use of a knowledge base, which is a collection of facts and information about a specific domain. A summarization system that uses a knowledge base can generate a summary by selecting the most relevant facts from the knowledge base and combining them into a coherent summary. Another example of an abstractive technique is the use of semantic representations, which are structured representations of the meaning of words and phrases in a document. A summarization system that uses semantic representations can generate a summary by identifying the most important concepts in a document and expressing them in a way that is easy for a reader to understand.
There are a number of factors to consider when choosing an approach to document summarization. Extractive techniques are generally simpler and faster to implement, but they may not always capture the main ideas of the original document as accurately as abstractive techniques. Abstractive techniques, on the other hand, can produce more coherent and accurate summaries, but they may require more data and computation resources.
In this paper, the focus is on summarizing transcripts of meetings, such as those from Microsoft Teams. Meeting transcripts can be particularly long and complex, as they often include multiple speakers discussing a variety of topics. As a result, it can be helpful to generate a summary of a meeting transcript to quickly convey the main points and decisions made during the meeting.
There are a number of techniques that can be used to summarize meeting transcripts. One such technique is TextRank, which is an extractive summarization technique that uses a graph-based algorithm to identify the most important words and phrases in a document. TextRank works by creating a graph of the words in a document, with edges between words that are frequently co-occurring. The most important words are then identified based on their centrality in the graph.
Another technique that can be used to summarize meeting transcripts is term frequency-inverse document frequency (TF-IDF). TF-IDF is a statistical measure that reflects the importance of a word in a document based on its frequency within the document and its rarity in a collection of documents. Another factor to consider when summarizing meeting transcripts is the level of detail that is desired in the summary. Some summaries may be intended to capture only the most important points and decisions made during the meeting, while others may require a more detailed summary that includes all of the main points and discussions.There are also a number of tools and software platforms available that can assist with the summarization of meeting transcripts. These tools often use a combination of extractive and abstractive techniques to produce summaries, and they may also allow users to customize the level of detail in the summary.
One example of a tool that can be used to summarize meeting transcripts is Microsoft Teams, which is a platform for online meetings and collaboration. Microsoft Teams includes a feature called "Meeting Notes" that allows users to take notes during a meeting and generate a summary of the meeting afterwards. The Meeting Notes feature uses a combination of extractive and abstractive techniques to produce the summary, and it also allows users to customize the level of detail in the summary.
In conclusion, document summarization is a useful tool for quickly conveying the main points of a document to a reader. There are two main approaches to document summarization: extractive and abstractive, with each having its own strengths and limitations. Meeting transcripts, in particular, can be long and complex, and tools such as Microsoft Teams can assist with the summarization of these transcript.
II. PROPOSED SYSTEM
The proposed architecture for the Meeting Summarizer utilizing NLP involves taking a transcript document as input and conducting various operations to generate a summary document. Specifically designed for Microsoft Teams transcripts, the architecture begins by removing time stamps and associating each sentence with its corresponding speaker. The text is then split into individual sentences for further processing.
During pre-processing, the text undergoes standardization and stop words are eliminated. The document term feature matrix is constructed using TF-IDF (Term Frequency-Inverse Document Frequency). Term Frequency denotes the frequency of a word in a document, while Inverse Document Frequency indicates the rarity or commonality of a word across all documents. TF-IDF scores are obtained by multiplying these two factors, determining the significance of a word within a document.
Next, a document similarity matrix is created by multiplying the document term feature matrix with its transpose. This matrix captures the similarities between each pair of sentences. A document similarity graph is then generated, with sentences as vertices and the similarity scores as weight or score coefficients.
The PageRank algorithm, which assigns scores based on the importance of nodes in a network, is applied to the document similarity graph. It calculates scores for each sentence, indicating their relative significance within the overall network.
Finally, the sentences are ranked based on their scores, and the top sentences are selected to form the output summary document. These sentences represent the most relevant and informative portions of the original meeting transcript.
The flowchart begins by taking the transcript document as input. It then proceeds with extracting sentences and tokenizing them for further use. The TF-IDF technique is applied to generate a document similarity matrix. This matrix is then utilized by the TextRank Algorithm, which assigns rankings to each sentence. Finally, the top-ranked sentences are selected to create the output summary document.
The transcript is tokenized using the nltk method, which splits it into individual words. Stop words are then removed, and stemming is applied to reduce words to their root form. The resulting words are transformed into vectors using numpy. These vectors represent normalized sentences. The TfidfVectorizer is imported from the sklearn module to create a document term frequency matrix. By taking the transpose of this matrix, the inverse document frequency is obtained. The number of sentences in the output summary is determined based on the number of pre-processed sentences in the input. If there are more than 30 sentences, the output will contain 20% of the total input sentences. Otherwise, 30% is considered. A similarity matrix is generated by multiplying the document term frequency matrix with its inverse. The PageRank algorithm is applied to the similarity graph, assigning rankings to each sentence based on their importance. Finally, the top-ranked sentences, determined by the rankings, are provided as the output.
III. TECHNOLOGIES USED
A. TEXTRANK Algorithm
TextRank is a graph-based ranking algorithm that was inspired by Google's PageRank algorithm. It can be used to identify the most relevant sentences in a text and to extract keywords. TextRank has a number of applications in natural language processing, such as keyword extraction, automatic text summarization, and phrase ranking. To identify the most relevant sentences in a text using TextRank, the algorithm creates a graph with the vertices representing each phrase in the document and the edges linking sentences based on content overlap. This can be done by calculating the number of words shared by two sentences. The sentences are then fed into the PageRank algorithm, which determines the importance of each sentence based on the network of sentences. The most important sentences are selected and used to create a summary of the text.
TextRank can also be used to extract keywords from a text by creating a word network that identifies which words are connected to one another. If two words appear frequently next to each other in the text, a link is created between them, and the link is given more weight if the words appear even more frequently together. The PageRank algorithm is applied to the generated network to determine the significance of each word. The top third of the most significant words are selected and used to create a keywords table by grouping together relevant terms that appear in the text in close proximity. TextRank architecture, which involves creating a graph of the text, applying the PageRank algorithm to the graph, and using the resulting scores to rank the phrases or words in the text. The most important phrases or words are then selected and used for the desired task, such as summarization or keyword extraction.
B. Term Frequency – Inverse Document Frequency (TF-IDF)
The term frequency-inverse document frequency (TF-IDF) technique is a method for evaluating the relevance of a word to a document in a collection of documents. It is widely used in information retrieval and natural language processing tasks, such as document search and classification. The TF-IDF measure is based on the idea that a word that occurs frequently in a document is likely to be important to the meaning of the document, but a word that occurs frequently in many documents is not as useful for determining the relevance of a document. To balance these two factors, the TF-IDF measure combines the term frequency, which is the raw count of the number of times a word appears in a document, and the inverse document frequency, which is a measure of how common or rare the word is in a collection of documents. To calculate the term frequency of a word, the system simply counts the number of times the word appears in the document. To calculate the inverse document frequency, the system divides the total number of documents in the collection by the number of documents that contain the word and takes the logarithm of the result. The inverse document frequency is used to downweight the importance of common words that occur in many documents. The TF-IDF weight of a word or phrase is obtained by multiplying the term frequency and inverse document frequency. The larger the weight, the more unusual the word or phrase is in the document, and the more likely it is to be relevant to the meaning of the document.
C. GLOVE Embedding
GloVe, which stands for Global Vectors for Word Representation, is an unsupervised learning algorithm used to generate word embeddings. Word embeddings are dense vector representations of words, where each word is mapped to a high-dimensional vector. These vectors capture semantic and syntactic relationships between words, allowing for better understanding of their meanings. The GloVe algorithm is based on the idea that word meaning can be inferred from the co-occurrence statistics of words in a large corpus of text. It analyzes the word co-occurrence matrix, which counts how often words appear together in a given context window. By factorizing this matrix, GloVe learns the embeddings that best capture the statistical patterns of word co-occurrence. The resulting word embeddings encode the semantic relationships between words. Similar words are represented by vectors that are close together in the embedding space, while dissimilar words are represented by vectors that are far apart. These embeddings can be used as features in various natural language processing (NLP) tasks, such as sentiment analysis, named entity recognition, machine translation, and text classification. GloVe embeddings have gained popularity due to their effectiveness in capturing semantic information and their ability to handle out-of-vocabulary words. They have been widely used in both research and industry applications to enhance the performance of NLP models.
IV. RELATED WORKS
V. RESULT
Meeting Summarizer using Natural Language Processing (NLP), TF-IDF, and the PageRank algorithm is a concise and informative summary of the meeting. The system processes the meeting transcript, removes irrelevant information, and extracts key sentences using NLP techniques. The TF-IDF approach helps determine the importance of words in the document, and the PageRank algorithm assigns scores to sentences based on their relevance in the context of the meeting. The top-ranked sentences, identified through these techniques, are selected to form the output summary. This approach allows for efficient extraction of essential information, facilitating better understanding and decision-making based on the meeting content.
VI. FUTURE SCOPE
The future scope of Meeting Summarizer using Natural Language Processing (NLP) is promising. Advancements may include multi-modal summarization by integrating audio, video, and text, enhancing speaker identification and tracking for personalized summaries. Contextual understanding, such as sentiment analysis and entity recognition, can lead to more accurate summaries. Real-time summarization for live meetings, user customization, and integration with collaboration tools offer improved productivity. Robust evaluation metrics are needed to assess summary quality. Multilingual support can broaden its applicability. These advancements aim to optimize information extraction, knowledge management, and decision-making in organizational settings, leading to more efficient and effective meeting summaries.
VII. ACKNOWLEDGEMENT
First, the authors wish to express our sincere gratitude to our project guide, Prof. Vydehi K, for her enthusiasm, patience, insightful comments, practical advice and unceasing ideas that have helped us tremendously at all times during our research. Her immense knowledge, profound experience, and professional expertise in NLP has enabled us to complete this research successfully which would not have been possible without her support and guidance. We also wish to express Our sincere thanks to Adi Shankara Institute of Engineering and Technology, for their consistent support.
Extractive summarization is a natural language processing task that involves selecting important sentences or phrases from a document and including them in a summary. It is commonly used for creating summaries of meeting transcripts, as it allows you to quickly understand the key points and conclusions of a meeting without having to read through the entire transcript. There are several algorithms that can be used for extractive summarization, such as the TextRank algorithm and the term frequency-inverse document frequency (TF-IDF) method. It is also important to perform preprocessing on the transcript to improve the quality of the text before generating the summary, which can involve tasks such as tokenization, lemmatization, and stopword removal. While extractive summarization can be a useful tool for quickly understanding the content of a meeting transcript, it does not always generate the most concise summary, as it may include irrelevant or redundant information from the original document. Abstractive summarization, on the other hand, involves generating new sentences that capture the main points of the original document and can be more effective at generating concise summaries. However, abstractive summarization is generally more challenging to implement than extractive summarization, as it requires the system to understand the meaning of the text and generate new sentences.
[1] Yash Agrawal, Atul Thakre, Tejas Tapas, Ayush Kedia,YashTelkhade and Vasundhara Rathod, Comparative analysis of NLP models for Google Meet Transcript summarization, EasyChair Preprint no. 5404, 2021. [2] Yuanfeng Song, Di Jiang ,Xuefang Zhao ,Xiaoling Huang, QianXu, Raymond Chi-WingWong, Qiang Yang, SmartMeeting: Automatic Meeting transcription and summarization for in-person conversations. ACM International Conference on Multimedia 2021, Pages 2777-2779. [3] Pratik K. Biswas, AleksandrIakubovich, Extractive summarization of call transcripts, arXiv Preprint arXiv: 2103.10599, 19 March, 2021. [4] Rani, Ujjwal and Karambir Bidhan, Comparative assessment of Extractive summarization: textrank, tf-idf and lda, Journal of Scientific Research 65.1 (2021): 304-311. [5] Nenkova A., McKeown K. (2012) ,A survey of text summarization Techniques. In: Aggarwal C., Zhai C. (eds) Mining Text Data. Springer, Boston, MA. [6] AravindChandramouli, SiddharthShukla, Neeti Nair, ShivenPurohit, ShubhamPandey and MuraliMohana Krishna Dandu, Unsupervised paradigm for information extraction from transcripts using BERT, arXiv Preprint arXiv:2110.00949,13 september 2021. [7] Yao, L., Pengzhou, Z., & Chi, Z. Research on News keyword extraction Technology based on tf-idf and textrank, IEEE/ACIS 18th International conference on computer and information science (ICIS), June 2019. [8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pretraining of deep bidirectional Transformers for language understanding, arXiv Preprint arXiv:1810.04805. [9] Xingxing Zhang, MirellaLapata, Furu Wei, and Ming Zhou. Neural latent Extractive document summarization, conference on empirical methods in Natural Language Processing, 2018. [10] Mallick C., Das A.K., Dutta M., Das A.K., Sarkar A. (2019) Graph-based text summarization using modified textrank. In: Nayak J., Abraham A., Krishna B., Chandra Sekhar G., Das A. (eds) soft computing in Data Analytics. Advances in intelligent systems and Computing, vol 758. Springer, Singapore. [11] Yue Dong, Andrei Romascanu, Jackie C. K. Cheung, HipoRank: Incorporating hierarchical and positional Information into Graph-based Unsupervised long document Extractive summarization, arXiv, 2020, volume:abs/2005.00513. [12] Derek Miller,Leveraging BERT for Extractive text summarization on lectures, arXiv preprint, arXiv:1906.04165,7 july 2019. [13] Wen Xiao and Giuseppe Carenini , Extractive summarization of long documents by combining global and local context, arXiv preprint, arXiv:1909.08089, 17 September 2019. [14] Li Manling, Zhang Lingyu, Radke, Richar J, Ji Hend,Keep meeting summaries on topic: Abstractive multi-modal meeting summarization. 57th conference of the association for computational linguistics, 2021. [15] Shashi Narayan, Shay B. Cohen, Mirella Lapata, Ranking sentences for Extractive summarization with reinforcement learning. arXiv preprint, arXiv:1802.08636,23 february 2018.Proceedings of the Sixth International Conference on Trends in Electronics and Informatics (ICOEI 2022)IEEE Xplore Part Number: CFP22J32-ART; ISBN: 978-1-6654-8328-5978-1-6654-8328-5/22/
Copyright © 2023 Vishnuprasad ., Paul Martin, Salman Nazeer, Prof. Vydehi K. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Paper Id : IJRASET53578
Publish Date : 2023-06-01
ISSN : 2321-9653
Publisher Name : IJRASET
DOI Link : Click Here